15 February 2018
Determine the need through collaboration
Articulate the need as a question
Scope out the 'minimum viable product'
Allow time for peer review
Communicate the findings in appropriate ways
Data, analysis outputs, and visualisation are stored in one place
Data is easily manipulated - high risk of introducing errors
Point-and-click interface - manual repetition
Actions are not logged
Slow when working with large data sets or files storing many computations
Complex Excel workbooks can 'break' when shared with others
Separation of data, analysis outputs, visualisations, and reports
Reproducible, transparent, and less error-prone workflow
Much faster for repetitive actions than point-and-click
Requires you to think early and often about data structure
Procedures are documented via scripts and command history
Build up a library of repeatable, adaptable procedures
R is a statistical programming language, RStudio is an IDE for using R
Its most fundamental component is 'base R' but it is highly extensible
R is free and open source: users benefit from packages created by others
The R community is global and active - lots of help for when you get stuck!
Demo 1 uses the winter data set downloadable from this link.
It includes all podium results from Winter Olympic Games that have taken place from 1924 to 2014, inclusive.
# The readr package is efficient for reading (importing) of CSV files
library(readr)
# Read the data in
winter <- read_csv("winter.csv", col_names = TRUE, col_types = NULL)
head(winter)
## # A tibble: 6 x 9 ## Year City Sport Discipline Athlete Country Gender Event Medal ## <int> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> ## 1 1924 Chamonix Biathlon Biathlon BERTHET,~ FRA Men Milit~ Bron~ ## 2 1924 Chamonix Biathlon Biathlon MANDRILL~ FRA Men Milit~ Bron~ ## 3 1924 Chamonix Biathlon Biathlon MANDRILL~ FRA Men Milit~ Bron~ ## 4 1924 Chamonix Biathlon Biathlon VANDELLE~ FRA Men Milit~ Bron~ ## 5 1924 Chamonix Biathlon Biathlon AUFDENBL~ SUI Men Milit~ Gold ## 6 1924 Chamonix Biathlon Biathlon JULEN, A~ SUI Men Milit~ Gold
Let's say we are interested in comparing the number of gold medals won by Canada, Norway, and Sweden from the last five Winter Olympic Games.
We need to subset the data to focus only on:
# The dplyr package includes useful data manipulation functions
library(dplyr)
# Subset the data
goldMedalComparison <- winter %>%
filter(Medal == "Gold"
& Country %in% c("CAN", "NOR", "SWE")
& Year >= 1998)
goldMedalComparison
## # A tibble: 358 x 9 ## Year City Sport Discipline Athlete Country Gender Event Medal ## <int> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> ## 1 1998 Nagano Biathlon Biathlon BJOERND~ NOR Men 10KM Gold ## 2 1998 Nagano Biathlon Biathlon HANEVOL~ NOR Men 20KM Gold ## 3 1998 Nagano Bobsleigh Bobsleigh LUEDERS~ CAN Men Two-~ Gold ## 4 1998 Nagano Bobsleigh Bobsleigh MACEACH~ CAN Men Two-~ Gold ## 5 1998 Nagano Curling Curling BETKER,~ CAN Women Curl~ Gold ## 6 1998 Nagano Curling Curling FORD, A~ CAN Women Curl~ Gold ## 7 1998 Nagano Curling Curling GUDEREI~ CAN Women Curl~ Gold ## 8 1998 Nagano Curling Curling MCCUSKE~ CAN Women Curl~ Gold ## 9 1998 Nagano Curling Curling SCHMIRL~ CAN Women Curl~ Gold ## 10 1998 Nagano Skating Short Track~ BEDARD,~ CAN Men 5000~ Gold ## # ... with 348 more rows
In the winter data set, the data is structured such that one row is one medal-winning athlete.
However…
For this analysis, we need to wrangle the data to get it into a format where one row represents one gold medal per event (rather than per athlete).
# Create a new variable to identify each unique event
goldMedalComparison$uniqueEventID <-
paste(goldMedalComparison$Year, goldMedalComparison$Sport,
goldMedalComparison$Discipline, goldMedalComparison$Country,
goldMedalComparison$Gender, goldMedalComparison$Event, sep="_")
goldMedalComparison %>% select(uniqueEventID)
## # A tibble: 358 x 1 ## uniqueEventID ## <chr> ## 1 1998_Biathlon_Biathlon_NOR_Men_10KM ## 2 1998_Biathlon_Biathlon_NOR_Men_20KM ## 3 1998_Bobsleigh_Bobsleigh_CAN_Men_Two-Man ## 4 1998_Bobsleigh_Bobsleigh_CAN_Men_Two-Man ## 5 1998_Curling_Curling_CAN_Women_Curling ## 6 1998_Curling_Curling_CAN_Women_Curling ## 7 1998_Curling_Curling_CAN_Women_Curling ## 8 1998_Curling_Curling_CAN_Women_Curling ## 9 1998_Curling_Curling_CAN_Women_Curling ## 10 1998_Skating_Short Track Speed Skating_CAN_Men_5000M Relay ## # ... with 348 more rows
# Identify duplicates goldMedalComparison$duplicates <- duplicated(goldMedalComparison$uniqueEventID) goldMedalComparison %>% select(uniqueEventID, duplicates)
## # A tibble: 358 x 2 ## uniqueEventID duplicates ## <chr> <lgl> ## 1 1998_Biathlon_Biathlon_NOR_Men_10KM F ## 2 1998_Biathlon_Biathlon_NOR_Men_20KM F ## 3 1998_Bobsleigh_Bobsleigh_CAN_Men_Two-Man F ## 4 1998_Bobsleigh_Bobsleigh_CAN_Men_Two-Man T ## 5 1998_Curling_Curling_CAN_Women_Curling F ## 6 1998_Curling_Curling_CAN_Women_Curling T ## 7 1998_Curling_Curling_CAN_Women_Curling T ## 8 1998_Curling_Curling_CAN_Women_Curling T ## 9 1998_Curling_Curling_CAN_Women_Curling T ## 10 1998_Skating_Short Track Speed Skating_CAN_Men_5000M Relay F ## # ... with 348 more rows
# Remove duplicates from the data # and remove temporary variables used for identifying duplicates goldMedalComparison <- goldMedalComparison %>% filter(duplicates == FALSE) %>% select(-uniqueEventID, -duplicates)
goldMedalComparison %>% select(Year, Sport, Country, Event)
## # A tibble: 105 x 4 ## Year Sport Country Event ## <int> <chr> <chr> <chr> ## 1 1998 Biathlon NOR 10KM ## 2 1998 Biathlon NOR 20KM ## 3 1998 Bobsleigh CAN Two-Man ## 4 1998 Curling CAN Curling ## 5 1998 Skating CAN 5000M Relay ## 6 1998 Skating CAN 500M ## 7 1998 Skating NOR 1500M ## 8 1998 Skating CAN 500M ## 9 1998 Skiing NOR Slalom ## 10 1998 Skiing NOR 10KM Pursuit ## # ... with 95 more rows
The data is now in a format that is ready for calculating summary statistics.
For instance, we can tabulate the total number of gold medals won by Canada, Norway, and Sweden at each of the Winter Games since 1998.
# The reshape2 package is useful for manipulating data
# from long-to-wide format and vice versa
library(reshape2)
# Calculate total number of gold medals per team per Winter Games
goldMedalTotals <- goldMedalComparison %>%
group_by(Year, City, Country) %>%
summarise(goldMedalTotal = length(Medal)) %>%
ungroup()
# Reshape the data from long-to-wide format for tabulating
goldMedalTotalsWide <- dcast(goldMedalTotals,
Year + City ~ Country,
value.var = "goldMedalTotal")
# Check the reshaped data goldMedalTotalsWide
## Year City CAN NOR SWE ## 1 1998 Nagano 6 10 NA ## 2 2002 Salt Lake City 8 12 NA ## 3 2006 Turin 7 2 7 ## 4 2010 Vancouver 15 9 5 ## 5 2014 Sochi 10 12 2
# NAs appear in the data because Sweden did not win any gold medals # at the 1998 and 2002 Winter Games # Replace these NAs with zeroes goldMedalTotalsWide$SWE[is.na(goldMedalTotalsWide$SWE)] <- 0
# The DT package is used to display interactive tables in HTML documents library(DT) # Display the table datatable(goldMedalTotalsWide, options = list(dom = 't'), rownames = FALSE)
We can use the data frame we've just created - gold medal totals per team per Games - to calculate the median number of gold medals won by Canada, Norway, and Sweden at the last five Games.
goldMedalMedians <- goldMedalTotalsWide %>%
summarise(CAN = median(CAN),
NOR = median(NOR),
SWE = median(SWE))
goldMedalMedians
## CAN NOR SWE ## 1 8 10 2
For demo 2, we will use data from season 1 of the AFLW to show how sports data can be visualised using R.
# This data was obtained from this Github repository:
# https://github.com/ropenscilabs/ozwomensport/tree/master/AFLW
aflwTeamStats <- read_csv("teams.csv", col_names = TRUE, col_types = NULL)
head(aflwTeamStats)
## # A tibble: 6 x 49 ## Club Kicks_TOT Kicks_AVG Handballs_TOT Handballs_AVG Disposals_TOT ## <chr> <int> <dbl> <int> <dbl> <int> ## 1 Adelaide ~ 1052 132 458 57.3 1510 ## 2 All Clubs 6840 118 3478 60.2 10318 ## 3 Brisbane ~ 977 122 416 52.0 1393 ## 4 Carlton 780 111 424 60.6 1204 ## 5 Collingwo~ 838 120 335 47.9 1173 ## 6 Fremantle 818 117 409 58.4 1227 ## # ... with 43 more variables: Disposals_AVG <dbl>, `Cont Poss_TOT` <int>, ## # `Cont Poss_AVG` <dbl>, `Uncont Poss_TOT` <int>, `Uncont ## # Poss_AVG` <dbl>, `Disp eff %_TOT` <dbl>, Clangers_TOT <int>, ## # Clangers_AVG <dbl>, Marks_TOT <int>, Marks_AVG <dbl>, `Cont ## # marks_TOT` <int>, `Cont marks_AVG` <dbl>, Marks50_TOT <int>, ## # Marks50_AVG <dbl>, `Hit-outs_TOT` <int>, `Hit-outs_AVG` <dbl>, ## # Clearances_TOT <int>, Clearances_AVG <dbl>, `Centre clr_TOT` <int>, ## # `Centre clr_AVG` <dbl>, Stoppages_TOT <int>, Stoppages_AVG <dbl>, ## # Rebound50_TOT <int>, Rebound50_AVG <dbl>, `Frees For_TOT` <int>, ## # `Frees For_AVG` <dbl>, `Frees Agst_TOT` <int>, `Frees Agst_AVG` <dbl>, ## # `One %s_TOT` <int>, `One %s_AVG` <dbl>, Bounces_TOT <int>, ## # Bounces_AVG <dbl>, Inside50_TOT <int>, Inside50_AVG <dbl>, ## # Behinds_TOT <int>, Behinds_AVG <dbl>, `Goal assists_TOT` <int>, `Goal ## # assists_AVG` <dbl>, `Goal acc %_TOT` <dbl>, `AFL Fantasy_TOT` <int>, ## # `AFL Fantasy_AVG` <dbl>, Interchanges_TOT <int>, ## # Interchanges_AVG <dbl>
Let's compare some basic stats across clubs. First, we need to prepare the data.
aflwTeamStats <- aflwTeamStats %>% filter(Club != "All Clubs")
# The ggplot2 package is widely used for plotting in R
# It uses a layer-by-layer approach to building plots
library(ggplot2)
# Sort by average kicks per game
aflwTeamStats <- aflwTeamStats[order(aflwTeamStats$Kicks_AVG), ]
# Build the plot
p <- ggplot(aflwTeamStats, aes(x = reorder(Club, -Kicks_AVG), y = Kicks_AVG, fill = Club))
p <- p + geom_bar(stat = "identity")
p <- p + scale_y_continuous(limits = c(0,150),
breaks = c(0,25,50,75,100,125,150))
p <- p + labs(title = "Comparing average kicks per game across AFLW teams",
x = "Club",
y = "Average kicks per game")
p <- p + theme(legend.position = "none",
axis.text.x = element_text(angle=90, vjust=0.5))
aflwKicksPlot <- p
This presentation is not a deep dive into statistics, but R is more than capable of running a vast array of stats including:
Try R: An in-the-browser tutorial for getting to know the R language - no software installation needed.
'Introduction to R': Free course on Data Camp.
'R for Data Science', by Garrett Grolemund & Hadley Wickham.
'Data wrangling, exploration, and analysis with R': Course materials from a unit taught at the University of British Columbia by Prof Jenny Bryan.
'R for data analysis and visualisation': Data Carpentry lesson using ecology data
The Data Science Guide: A useful resource for finding content all over the web for learning different data science skills and programming languages. You can tailor your search to your current level of expertise in a given domain.
'R for Excel Users': Blog post by Gordon Shotwell
'Introduction to R for Excel Users': Free ebook by Tom Hopper
Follow the #rstats hashtag on Twitter
Many prominent R users and developers have blogs and / or tweet regularly. Here are some of my favourites:
@JennyBryan | @ma_salmon | @alice_data | @dataandme | @visnut | @kierisi | @thomasp85 | @xieyihui | @drob | @juliasilge
Join, read, and contribute to the R Studio Community discussions: https://community.rstudio.com/
You can discover lots of weird, wonderful, and useful R packages and repositories on Github: https://github.com/trending/r
AFLW data stored on Github
ATP and WTA tennis data, accessible via the R package deuce
Olympic Sports and Medals, 1896-2014 collated by The Guardian
NBA, NFL, MLB, and soccer data from FiveThirtyEight
NFL play-by-play data, accessible via the R package nflscrapR
MLB data from Lahman's Baseball Database
Cricket data, accessible via the R package cricinfo
WNBA player and game data from WNBA.com
AFL match statistics from AFL Tables
AFL draft and trade data from Draft Guru